A Quick Guide to Benchmarking AI Models on Azure: ResNet with MLPerf Training v3.0

HugoAffaticati · ‎Jun 28 2023

By Hugo Affaticati (Technical Program Manager), Sonal Doomra (Technical Program Manager 2), and Jon Shelley (Principal TPM Manager).

Introduction

Azure is pleased to showcase results from our MLPerf Training v3.0 submission. For this submission, we benchmarked our ND H100 v5 virtual machine (preview), with innovative technologies like:

8x NVIDIA H100 Tensor Core GPUs interconnected via next gen NVSwitch and NVLink 4.0
400 Gb/s NVIDIA Quantum-2 CX7 InfiniBand per GPU with 3.2Tb/s per VM in a non-blocking fat-tree network
NVSwitch and NVLink 4.0 with 3.6TB/s bisectional bandwidth between 8 local GPUs within each VM
4th Gen Intel Xeon Scalable processors
PCIE Gen5 host to GPU interconnect with 64GB/s bandwidth per GPU
16 Channels of 4800MHz DDR5 DIMMs

Full results on MLCommons website.

How to replicate the results in Azure

Pre-requisites:

Deploy and set up an ND H100 v5 virtual machine on Azure using Azure Portal or Azure CycleCloud.

Set up the environment

First, one needs to download the container from NVIDIA NGC (account needed). Then, one can clone the code from MLCommon's GitHub repository specific to Azure and publicly available.

cd /share
docker pull nvcr.io/nvdlfwea/mlperfv30/resnet:20230428.mxnet
git clone https://github.com/mlcommons/training_results_v3.0.git
cd /share/training_results_v3.0/Azure/benchmarks/resnet/implementations/ND_H100_v5

Get the dataset for ResNet

ResNet utilizes the ImageNet dataset from 2012. One will need both Training images (Task 1 & 2) and Validation images (all tasks) for MLPerf training v3.0.

For the Training images:

mkdir /share/data && cd /share/data
wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar
mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train
tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
cd ..

For the Validation images:

wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar
mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val 
tar -xvf ILSVRC2012_img_val.tar && rm -f ILSVRC2012_img_val.tar
wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash

Run the ResNet benchmark

The steps to run the benchmark consist of sourcing the configuration file, and starting the benchmark.

cd /share/training_results_v3.0/Azure/benchmarks/resnet/implementations/ND_H100_v5
source config_DGXH100.sh
CONT=nvcr.io/nvdlfwea/mlperfv30/resnet:20230428.mxnet DATADIR=/share/data LOGDIR=results ./run_with_docker.sh

The above steps can be replicated for the other MLPerf Training v3.0 benchmarks. One would have to use the corresponding configuration file and steps to preprocess the data.

#AzureHPCAI #MakeAIYourReality

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

A Quick Guide to Benchmarking AI Models on Azure: ResNet with MLPerf Training v3.0

Introduction

How to replicate the results in Azure